A deep neural network model for multi-view human activity recognition

Multiple cameras are used to resolve occlusion problem that often occur in single-view human activity recognition. Based on the success of learning representation with deep neural networks (DNNs), recent works have proposed DNNs models to estimate human activity from multi-view inputs. However, currently available datasets are inadequate in training DNNs model to obtain high accuracy rate. Against such an issue, this study presents a DNNs model, trained by employing transfer learning and shared-weight techniques, to classify human activity from multiple cameras. The model comprised pre-trained convolutional neural networks (CNNs), attention layers, long short-term memory networks with residual learning (LSTMRes), and Softmax layers. The experimental results suggested that the proposed model could achieve a promising performance on challenging MVHAR datasets: IXMAS (97.27%) and i3DPost (96.87%). A competitive recognition rate was also observed in online classification.

Current multi-view approaches comprise conventional computer vision (CV) or DNNs methods [12,13]. The conventional methods require sophisticated features extraction to identify informative features from raw data [14][15][16]. The features extractor usually works independently from the classifier [17,18]. Studies based on this approach focus either on the classifier or on feature engineering [19,20].
Representing human action from multiple views is the major challenge in feature engineering studies for multi-view action recognition. Previous studies have encoded human movement as low-level representation such as histograms of gradient (HoG) [14], silhouettes [15], and optical flow [21] that were extracted from RGB images. Afterwards, they were used for • Evaluation of the model's performance during actual application in online classification indicated that longer image sequences produced higher recognition rates.

Related work
In the last decade, conventional CV approaches have dominated the MVHAR field; they represented human body configuration using 2D, 3D, and 4D models. Methods using 2D models extracted silhouettes and optical flow from sequences of images for direct classification [9,15] or transformation to higher-level features [1,18,38,39]. High-level features such as silhouettes contour points and centers of mass [15] showed superiority over other methods employing 2D data [18] to encode movement in human action. The conventional CV approaches with 3D/4D models required a sophisticated algorithm to extract informative features from RGB images [8,9,17,21]. The existing approaches either directly concatenated all features from multiple views [8,9,22] or weighted multiple hypotheses [13,21] from those inputs to discriminate the human activity. Pehlivan et al. [8] encoded sequences of silhouettes from multiple views as cylindrical shapes with different rotations, while Weinland et al. [22] extended motion history (MH) determined from a single view to a motion history volume that combined MH from multiple views. With a six-step feature extractor, Holte et al. [9] determined 4D spatio-temporal interest points and local descriptions of 3-D motion features from image sequences. Another study [40] combined local and global features with self-similarity matrix; the study did not require 3D model to represent subjects' activity from multiple views. These approaches classified human activity by feeding the combined features to a classifier algorithm.
In contrast with features-fusion, score-fusion involves separate treatment of input features, followed by a combination of hypotheses from all inputs using weighting functions. Previous works have involved score-fusion using the arithmetic mean [30], fixed weight operation [13], and a data-driven adaptive weight algorithm [21]. Arithmetic means assumed that prior knowledge of informative views was unknown; it treated all confidence scores equally [30]. Fixed weight operation involved learning to identify informative inputs from the data [13], while adaptive algorithm weighted the hypothesis with different masks during the inference [21].
In This study aimed to resolve the multi-model issue and investigate the performance of the DNNs model with features and score fusions in MVHAR. We shared a single CNNs block and LSTMRes block across multiple inputs and then fused multiple hypotheses produced by the model using score fusion to predict human activity. Previous researchers combined scores with weighting functions that determined parameter values during training [13] or inference [21]. In contrast, in the proposed model, it was assumed that there was no prior knowledge of informative views. Therefore, prediction scores from individual view inputs are treated equally with the arithmetic mean or weighted with the geometric mean during training and inference.
The proposed model employed RGB images as inputs and did not combine the latent variables with another modality. The work examined the proposed model's performance with the IXMAS [22], and i3DPost [31] datasets and evaluated its implementation in an online scenario.

Proposed DNNs model
The proposed DNNs model comprised a pre-trained CNNs, an attention layer, an RNNs layer, and a score-fusion layer [26] (see Fig 1), with multiple inputs and outputs representing multiple views and actions.

Pre-trained CNNs
Pre-trained CNNs were used to extract spatial information from RGB images in the proposed model. The pre-trained models' intermediate or final layers' output were used to extract features from sequences of images. The output comprised a feature map f of shape H × W × C, where H, W, and C are height, width, and channel, respectively. Hence, the feature vector for the T time-step was F ¼ ½f 1 ; ::: This study involved an examination of the pre-trained models VGG-19 and VGG-16 [46] comprising five blocks with different numbers of CNNs. VGG-16 comprised 12 stacked CNNs: two in 1st, 2nd, and 3d blocks, and three in 4th and 5th blocks. While VGG-19 had an extra CNN in the 4th and 5th blocks, making 14 stacked CNNs in total. This paper refers to Ith CNN in N-th block as blockN_convI.

Attention layer
Since it was assumed that significant transformation occurred only in certain parts of image sequences when subjects performed actions, the proposed model filtered out out uninformative features by employing an attention layer [47] that weighted important features with higher probability and the others with lower probability.
Given the feature vector F of shape T × G, the attention mask was computed by averaging attention scores over G. The first step to determine relevant features was to estimate attention probability at each time step for the G dimension. For the feature map at the t-th time step f t , attention probability was given by where g att was an attention network with weight θ t , and s t was the attention score map for the feature map. The attention score α t was the probability produced from Softmax function incorporating the subject of interest with a higher probability than the rest. Dense, convolutional, and RNNs layers can be used as attention networks [48]; the proposed model employed a dense layer for the attention network because, in the preliminary experiment, we found CNNs and RNNs caused over-fitting. After computing attention probability at each time step, the relevant features were calculated usingf

PLOS ONE
where � represents the element-wise operator or the Hadamard product [49] weighting features extracted from pre-trained CNNs with α t .

Residual learning in LSTM
A long short-term memory (LSTM) architecture [50] was proposed to solve the problem of vanishing and exploding gradients associated with conventional recurrent neural networks (RNNs) [26]. The architecture, however, still can suffer from degradation problems caused by deeper neural network structure [51]. Residual learning was proposed to tackle this issue by introducing a shortcut connection from the earlier to the later layers that helps the earlier layer get a-"fresh"-gradient from the latter one during backpropagation [52].
In contrast to the highway network approach [53], residual learning formulation [52] involved an identity shortcut to ensure ongoing learning. Residual function H(z i ) could be expressed as: where F(z i , W i ) and z i−m represent the original mapping and output from the earlier layer, respectively and W s was a linear projection that was used when the dimension between F(z, W i ) and z i−m was unequal, as realized via linear mapping.
In LSTM, residual mapping could be accomplished by introducing a shortcut connection to the adjacent layer, from layer t to t + 1 [54] (Eq 6), or by establishing a connection to the memory cell [23] (Eq 7, implementation: Fig 2).
Here, o t , C t , h t represent the output gate, memory cell, and hidden units, respectively.

Shared weight LSTMRes
The model in our previous work used Multiple Sequence LSTMRes (MSLSTMRes) to decode temporal deformation changes in features from multiple cameras [23]. MSLSTMRes experimentally outperformed baseline topology at the expense of computational time.
To address that issue, the proposed model used shared weights of pre-trained CNNs and stacked LSTMRes (comprising two LSTMRes with 512 units) across inputs from all cameras. Previous work applied a shared hidden-layer network to find similarities in speech and text [55]. This work employed a shared-weight model to learn transformation and similarity among features from multiple cameras, enabling late fusion using only a single model.

Score fusion
Arithmetic or geometric means are used to combine prediction scores from Softmax layers. With the former, scores from all cameras were treated as a mixture, while the latter allows one prediction result from a single camera to veto other outcomes. The proposed model calculated final prediction scores using arithmetic mean (Eq 8)or geometric mean (Eq 9).
Here, y ac represents the probability score of an action a from camera c. M and N are respectively the total number of actions and cameras.

Datasets and evaluation metrics
The IXMAS dataset [22] is a benchmark in MVHAR algorithm evaluation that comprises videos of 12 subjects performing 13 actions: watch checking, arms crossing, head-scratching, sitting, getting up, turning around, walking, waving, punching, kicking, pointing, picking something up, and throwing. Videos were recorded using five cameras at 23 fps. Subjects performed each action three times with free positioning and orientation. The i3DPost dataset [31] was recorded using eight synchronized cameras with a resolution of 1920x1080 and 25Hz progressive scan. The eight subjects performed 12 actions (walking, running, jumping, bending, waving, jumping in place, sitting-standing, running-falling, walking-sitting, running-jumping-walking, hand-shaking, and pulling), creating 96 multi-view videos of human activity.
The proposed model's performance was evaluated with categorical cross-entropy loss, classification accuracy, and F1-score metrics. The accuracy rate was computed by averaging top-1 accuracy for given data, while F1-score was the average F1-score for all classes. p-value was computed using Student t test [56].

Pre-processing and learning
To reduce distortion in images and ensure the features were on the same scale, RGB-normalization and feature standardization were performed in pre-processing. The mean and standard deviations were computed individually for each dataset to standardize the value of images. Gamma correction was applied to images of IXMAS dataset; the gamma value was 1.5.
In the experiments, the proposed model was trained with three scenarios (Table 1) In all scenarios, backpropagation with RMSProp optimizer [57] was used. Glorot uniform [58] and orthogonal [59] initializers were used to initialize the parameter values of kernel and recurrent weights, respectively; the bias values were initialized to be zero. As this study used pre-trained CNNs in all experiments, it did not apply parameters initialization to CNNs.
Evaluation in scenario II and III involved one-leave-subject cross-validation, while scenario I used train-test evaluation. We also applied early stopping during training in scenario II and III.
Action image sequences were trimmed to 22 frames for the scenario I and 20 frames for the other scenarios; experimental results showed that using 20 frames resulted in higher accuracy of the proposed model. Data augmentation was performed in scenario III by sub-sampling a frame sequence with different frequencies to prevent over-fitting. The hyper-parameters' values were determined via grid search.

Exploration studies
This section details the results of exploration studies using the IXMAS dataset. The experiments included: other models, such as ResNet [52] and Inception [61] were not used because they impaired the recognition rate of the proposed model.
3. Evaluation of the multi-model approach and shared-weight technique.
4. Comparison of features fusion with score fusion using arithmetic and geometric means.
In every experiment, we used the most optimal structure for the succeeding experiment. LSTMRes vs LSTMResKim. Fig 3 depicts the performances of LSTM, LSTMRes, and LSTMResKim on IXMAS based on training and validation errors. The results suggested that

PLOS ONE
validation errors of the proposed model with the LSTMRes were lower than that with LSTMResKim. The training error with LSTMRes decreased steadily throughout the learning process. Instability, however, appeared in those of LSTMResKim after the 60th iteration.
In contrast, the LSTMRes exhibited slightly lower training and validation loss than LSTM and ConvLSTM. The performances of LSTM and LSTMRes were identical. These outcomes indicated that performing residual learning in the LSTM memory cell provides insignificant improvement with the model. In consideration of these results, we used LSTMRes used for the rest of the experiments reported here.
Pre-trained CNNs. VGG-16 and VGG-19 models trained with the ImageNet dataset were examined as CNNs blocks for the proposed model. We conducted three experiments using the intermediate block4_pool and the last block5_pool layers to find an appropriate pretrained model and clarify the effect of fine-tuning. The first experiment used the intermediate layer as a feature extractor without fine-tuning the parameters, while the second applied it. The last experiment was conducted by fine-tuning the CNNs (from block4_conv2 to block5_conv3).   Shared weight, no MSLSTMRes. We previously found that MSLTMres yielded higher accuracy than the baseline model [23]. However, the recognition rate came at the expense of computational time and parameter numbers.
Given the benefits of shared-layer DNN in language modeling [55], we investigated related effects in multi-view action recognition, sharing the pre-trained VGG-16 and stacked LSTMRes of the proposed model across inputs from all cameras. Different attention layers were used for different views, and feature fusion was used to compute action probability.
The proposed model obtained a 1.01% (n = 396, p = 0.617) higher accuracy than with the use of MSLSTMRes (Fig 5). Shared-weight application also resulted in fewer parameters (the proposed model: 70,304,393, [23]: 351,323,711) and lower complexity with the proposed model, improving computation time.
Score fusion. This experiment compared the proposed approach's performance when using features and score fusions. Features fusion estimates action probability using a combination of the features from multiple cameras. Scores fusion, however, combined the prediction scores from multi-view inputs using the arithmetic or geometric mean.
This experiment used pre-trained VGG-16 and shared-weight LSTMRes as CNNs and RNNs layer for the proposed model, respectively. The models were trained with scenario II (Section Pre-processing and learning). The proposed DNNs model exhibited an average accuracy of 97.22% with the use of arithmetic mean (Fig 6), which was 1.51% higher than using feature fusion. Scores fusion with the geometric mean created the opposite effect, decreasing the proposed model's accuracy rate by 1.77%. These results suggested the proposed DNNs model performed better when using the arithmetic mean as the score fusion.  (Table 2), while the lowest improvement (1.01%) was gained when replacing MSLSTMRes with shared LSTMRes. Although higher-resolution than in our previous research [23] was used here (128 x 128 vs 73 x 73 pixels), the recognition rate with the proposed model increased by 0.25%, which was insignificant, considering the small dataset.
Application of the optimized configuration also increased the accuracy of the proposed model in recognizing actions performed by hands (e.g., watching check, arms crossing, waving, and punching) (Fig 7). The highest improvement (25%) was achieved in the identification of "wave" action. We used the aforementioned new configuration for the next experiments: comparison of the proposed model with a single-input model and the state-of-the-art methods.

Comparison with a single-input model
We performed an experiment on IXMAS with 13 actions to evaluate the proposed model's performance using single-view and multi-view inputs. Comparison of results (Fig 8)

PLOS ONE
demonstrated that combining information from multi-view inputs produced a significant improvement in accuracy rate by 20.17±8.57% (n = 396, average p < 0.05). Compared to multi-view applications, the outcomes also indicated that the use of input from Cam5 (topview) produced a 37.18% lower recognition rate, while the employment of the other views yielded a 15.92±1.23% lower accuracy rate.

Comparison with state-of-the-art methods
We compared the proposed model to state-of-the-art methods on IXMAS and i3DPost (Tables  3 and 4). Note that their results were not reproduced and the proposed model used 2D RGB images as inputs. Following the previous studies' experiment protocol, we evaluated the proposed model on the IXMAS dataset, with 11 subjects performing 10-action. We used data of all subjects in the evaluation of 13 actions on IXMAS and 10 and 12 actions on i3DPost. This experiment used learning scenario III (Sec. Pre-processing and learning). The "Input"-column shows the type of features used in the methods. Evaluation was performed with the proposed model based on actions performed by one subject and two subjects ("12 Actions"-column). Mygdalis et al. [63] validated their model's performance using 3-fold cross-validation. � shows DNN based approach or the methods employing DNN based features. https://doi.org/10.1371/journal.pone.0262181.t004

PLOS ONE
In evaluation with IXMAS, the proposed model outperformed all 2D methods in recognizing 11 actions by 12.05% on average (Table 3). Performance was higher than methods employing 3D features representation [8,22] but was slightly lower than the outcomes reported in [17]. The proposed model also got a 4.46% higher accuracy rate than other DNN models using 2D inputs and achieved competitive results to the models using 2D + optical flow inputs. However, the accuracy rate of the proposed model was 2.33% lower than that of the adaptive score fusion method.
In addition, the model produced a recognition rate of 96.37% in classifying 13 actions, outperforming Pehlivan et al. [8] with the use of 3D features. However, the proposed DNN model's recognition rate was still lower than 4D models [9].
The performance of the proposed DNN model in recognizing 10 actions on i3DPost was comparable to state-of-the-art methods ( Table 4). The proposed model often misclassified actions with similar body configurations, such as jumping and bending, and exhibited confusion with differentiation of single and combined actions, such as "walking" and "runningjumping-walking" (Fig 9). The model achieved higher performance in classifying 10 actions and 2 interactions.
The proposed model obtained an average F1-score higher than 0.9 for all classes with all datasets ( Table 5). The proposed model achieved the lowest F1-score when evaluated with 10-action on i3DPost and attained the highest F1-score on evaluation with 11-action on IXMAS.

Online classification
In the online scenario, we did not segment individual action sequences based on the action labels, but used a sliding window to create clips from video content. The proposed model should determine early and ambiguous actions (Fig 10) from unfinished sequences of actions or transitions phases between actions. This study investigated how length of a clip affected the performance of the proposed model by setting the value of the sliding time window t to 10, 20, 30, 40, and 50. The proposed model was trained based on learning scenario II (Section Preprocessing and learning) to estimate subjects' activity in each frame. The final prediction was the average probability scores over the sequence of images. The experimental results (Fig 11) show that the highest accuracy and F1-score were attained with t = 50. The accuracy and F1-score of the proposed model increased with longer sliding

PLOS ONE
window values. However, this did not represent a proportional correlation, as the recognition rate was 0.69% lower at t = 20 than at t = 10.
The imbalance dataset (Fig 12(A)) did not impair the overall performance of the proposed model: the proposed model achieved F1 scores higher than 0.6 in all scenarios. Besides, the proposed model classified sitting-down, getting-up, and picking-up actions with over 80% accuracy rate, even though the percentage of data based on such actions was lower than the others. However, the experimental results for t = 50 (Fig 12(B)) shows issues with the proposed model in differentiating actions performed only by hands (e.g., head-scratching, waving, punching, pointing, and throwing), the recognition rate at less than 70%.

Discussion
When a subject performs activity in a dynamic environment, self and inter-object occlusion may occur. Multi-view human activity recognition helps to prevent a complete loss of information when occlusion appears in a single camera by providing information from other cameras [5,11]. Previous findings have indicated that employing multiple view increased the recognition rate of human activity in a dynamic environment [6,7].
This study presents a novel DNN model employing shared-weight and score fusion to classify human activity from multiple views. The experimental results suggested that the proposed model achieved optimal performance with VGG-16 as the pre-trained CNNs, shared-weight LSTMRes as RNNs layer, and average mean as score fusion. Exploratory studies showed that fine-tuning of pre-trained CNN parameters may not improve accuracy; fine-tuning block4 of VGG-19 impaired the performance of the proposed model. We also found the model was better co-adapted with shallow pre-trained CNNs, as shown by improved performance for transfer learning with VGG-16. Those outcomes were attributed to VGG-16's output, which produced less domain-specific features than VGG-19.
Comparison between the proposed LSTMRes, and the method detailed by Kim [54] showed that shortcut connection between adjacent layer outputs of LSTM introduced instability, impairing performance. These results were consistent with those reported by Krueger et al. [64] and suggested an insignificant improvement in training and generalization performance with LSTM associated with residual learning in memory cells. Second, the findings of this study suggested that even with a few training data, the proposed DNN model could attain competitive performance. The results showed that transfer-learning and score fusion with arithmetic mean improved the model's performance. Besides, compared to other DNN-based methods, the proposed model outperformed models using 2D modality [33] and achieved a lower accuracy rate than methods employing more complex modalities, such as optical flow [21,34] and skeleton data [44].
Another major finding of this study was that the proposed model required a longer sliding window to attain optimal performance. That contradicts the results of previous work [65] that found a brief sequence was sufficient for the evaluation of basic human actions. One interpretation of these findings is that the applicable number of frames to recognize human activities may differ from case to case; we used different datasets from Schindler et al. [65]. In our experiment, short sequences resulted in ambiguous clips, impairing the performance of the proposed model.
As seen in previous works [1,6], this study found that combining information from multiple views resulted in a higher accuracy rate of the proposed model in MVHAR. This suggested that additional information from another view mitigated information loss caused by occlusion. The results also suggested that the proposed model could filter out uninformative features, since the recognition rate did not decline when input from Cam5 was combined with other views; using single-view input from Cam5 resulted in impaired performance of the proposed model.
Despite promising results achieved with the proposed model compared to state-of-the-art methods, this study had several limitations. First, the model did not get satisfying results evaluated for online scenario requiring classification of sequences of ambiguous action. Second, the study only evaluated the proposed model's performance with two benchmark datasets, which comprised less than 15 subjects performing basic activities. Hence, the experimental results remain preliminary. Last, only self-occlusion was observed in the datasets. Accordingly, the proposed model's performance for mutual occlusion is unclear. Further study is needed to evaluate and improve model performance in an online scenario involving more subjects. It also would be of interest to consider evaluating the model with more benchmark datasets comprising complex activities performed in various situations, such as CASIA [66], UCF101 [67], MOD20 [68], and HMDB51 [69] datasets.